An Improved Analytical Superscalar Microprocessor Memory Model
نویسندگان
چکیده
As the number of transistors that can be integrated onto a single chip continues to increase exponentially, a growing challenge is modeling performance with reasonable accuracy in the early stages of processor design. While methodologies for execution driven simulations are well understood, comparatively little is known about how to develop accurate analytical models. Processor architects in industry have occasionally employed ad hoc analytical modeling techniques in an attempt to rapidly focus the search for higher performance designs. Moreover, analytical models can provide insights that a detailed performance simulator may not. This paper proposes techniques to accurately model the performance impact of long latency data cache misses in a superscalar microprocessor. A pending data cache hit results from a memory reference to a cache block for which a request has already been initiated by another instruction but has not yet completed (i.e., the requested block is still on its way from memory). These pending cache hits have a non-negligible influence on accuracy of analytical models when analyzing memory intensive benchmarks. We propose a technique to quickly identify pending data cache hits and account for their effect on performance by analyzing memory reference patterns without performing detailed performance simulations. We also propose a novel profiling method to take account of the maximum number of outstanding cache misses supported by the memory system. Overall, these approaches improve performance prediction accuracy by a factor of 3.9 on average (error decreases from 39.7% to 10.3%) for a set of memory intensive benchmarks when the maximum number of outstanding misses supported is unlimited. Moreover, on average our model achieves 151 and 170 times speedup over detailed simulations with less than 10% error, when the maximum number of outstanding misses supported is sixteen and eight, respectively.
منابع مشابه
An application specific multi-port RAM cell circuit for register renaming units in high speed microprocessors
We present a novel custom circuit for superscalar microprocessor renaming unit and compare its performance with a conventional design, referring to an industrial 0.35 μm CMOS process. Speed and power consumption are significantly improved.
متن کاملInstruction-Level Microprocessor Modeling of Scientific Applications
Superscalar microprocessor efficiency is generally not as high as anticipated. In fact, sustained utilization below thirty percent of peak is not uncommon, even for fully optimized, cache-friendly codes. Where cycles are lost is the topic of much research. In this paper we attempt to model architectural effect on processor utilization with and without memory influence. By presenting analytical ...
متن کاملA Split Data Cache for Superscalar Processors
Superscalar implementations of RISC architectures are emerging as the dominant high-performance microprocessor technology for the mid-1990’s. This paper proposes and evaluates a split data cache memory design, a new memory system enhancement for superscalar processor architectures. This design allows floating-point and integer memory accesses to be executed in parallel. The configuration is wel...
متن کاملInternal Organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC Microprocessor
A new CMOS microprocessor, the Alpha 21164, reaches 1,200 mips/600 MFLOPS (peak performance). This new implementation of the Alpha architecture achieves SPECint92/SPECfp92 performance of 345/505 (estimated). At these performance levels, the Alpha 21164 has delivered the highest performance of any commercially available microprocessor in the world as of January 1995. It contains a quad-issue, su...
متن کاملThe Mips R10000 superscalar microprocessor
cache refills early. he Mips RlOOOO is a dynamic, superscalar microprocessor that implements T the 64-bit Mips 4 instruction set architecture. It fetches and decodes four instructions per cycle and dynamically issues them to five fully-pipelined, low-latency execution units. Instructions can be fetched and executed speculatively beyond branches. Instructions graduate in order upon completion. A...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008